Tafazzin is an important mitochondrial acyltransferase responsible for the remodeling of cardiolipin, the major phospholipid making up the mitochondrial membrane. This protein is involved in the final remodeling step of cardiolipin to synthesize mature cardiolipin with tetralinoleic tails that contribute to the structure and function of mitochondria. Defects in tafazzin are known to lead to a disease known as Barth Syndrome, which is characterized by cardiomyopathy, weakness in muscles, and low white blood cell counts. Predicted structures of human tafazzin include an alpha helical transmembrane anchor, a catalytic site for acyltransferase activity, and a positively-charged membrane associated region (Hijikata et al., 2015). While more studies have yet to be done on tafazzin structure itself, other studies have shown an essential histidine and arginine residues, as well as a highly conserved 7 amino acid sequence in the catalytic site of acyltransferases (Heath et al., 1998, Dircks et al., 1999).

Scientific Hypothesis: If tafazzin is a phospholipid acyltransferase that is integral to mitochondrial function, then the important functional domains of tafazzin, such as the acyltransferase domain, must be highly conserved across species and similar to other phospholipid acyltransferases.

Bioinformatics Analysis

Multiple sequence alignment was done as the first bioinformatics analysis. Tafazzin protein sequences from multiple species and other acyltransferase sequences from humans were aligned for comparison. This allows visualization of how conserved the tafazzin protein is compared to other species. Comparing the conserved sequences to other acyltransferase sequences will help identify if they have similar functional domains.

Homology modeling and structural bioinformatics is the second bioinformatics analysis. This method creates a 3D model of what the protein structure would look like. By comparing the structures of the different tafazzin orthologs, one may be able to visualize if similar sequences still lead to similar structures, different sequences lead to similar structures, and so on. If the sequence isn’t conserved, perhaps the structure is still conserved. Additionally, comparing the structure of tafazzin to other acyltransferases involved in the mitochondria may elucidate if there are similar structures that may be important in the acyltransferase activity.

Visualization Analysis

Phylogenetic clustering is the first visualization analysis. This aids in visualizing how connected / related the protein sequences are, as well as where they branch off, using a tree diagram. Phylogenetic clustering may help explain which domains may be more conserved than others, so it gives a better idea of how they evolved, and what was conserved despite the evolution.

3D protein measurement is the second visualization anaysis. 3D protein measurement helps calculate the characteristics of which residues are in which domain / structure of the protein or what their Index of Hydrophobicity is. Distances and angles between atoms can also be calculated. By looking at the specific characteristics of each residue, one can draw a conclusion on what type of residues are conserved in the structure, and what the residues contribute to the structure that is conserved.

Downloading Data

The data for this analysis can be found in NCBI (https://www.ncbi.nlm.nih.gov/gene?Db=gene&Cmd=DetailsSearch&Term=6901, orthologs and homologs can be found by scrolling down to the "General Gene Information" tab and checking under "homology") as well as UniProt (https://www.uniprot.org/uniprot/?query=tafazzin&sort=score). Data was downloaded as fasta files from NCBI and compiled into one fasta file, and data from UniProt was downloaded as separate PDB files.

Loading in Packages

Bio:

Bio stands for Biopython. It is a package that contains many modules used for biological analysis. Some modules it contains is AlignIO and SeqIO, which both take in sequences and display sequence alignments. They can take an input from a variety of file formats (such as fasta, clustal, or phylip) and load it into BioPython for it to be used and analyzed. Sequences can either be in separate files or all compiled into just one. Another module is Seq, which takes sequences in as strings rather than files. The Seq objects can be transcribed, translated, and manipulated in other ways to mimic biological methods. More information on BioPython and the mentioned modules can be found here: https://biopython.org/wiki/Getting_Started, https://biopython.org/wiki/AlignIO, https://biopython.org/wiki/SeqIO, https://biopython.org/docs/1.75/api/Bio.Seq.html

os:

os is a package in Python that allows the terminal/Jupyter Notebook to interact with the operating system. In other words, it allows you to work with files and directories on your laptop. It allows you to read files in Jupyter Notebook as well as access different paths. It can also retrieve files from your operating system. More information on os can be found here: https://docs.python.org/3/library/os.html

sys:

sys is a package in Python that allows access to certain variables or functions in the Python runtime environment. Here, we use it to get the stout and stderr (standard output and standard error). Using the stout function from the sys module allows us to display the output. On the other hand, stderr will write whenever an exception occurs. More information on sys can be found here: https://www.geeksforgeeks.org/python-sys-module/

Bio.Align.Applications:

Bio.Align.Applications is a package in BioPython that contains many commandline wrappers. The commandline wrappers allow Jupyter Notebook to run downloaded software that can carry out multiple sequence alignments. Such commandline wrappers include MafftCommandline, which is used in this project to do multiple sequence alignment. However, it can handle many more commandline wrappers, such as one for MUSCLE, ClustalW, and so on. Each wrapper contains different functions that will utilize the corresponding software to carry out and display the alignment.More information on Bio.Align.Applications can be found here: https://biopython.org/docs/1.76/api/Bio.Align.Applications.html

tempfile:

tempfile is a package in Python that can create temporary files and directories. It is used when you don't want to create more files and clog up your space/data. Instead, tempfiles can be generated and deleted for the function that you are trying to carry out. Functions include creating named temporary files, secure temporary files, or spooled temporary files. More information on tempfile can be found here: https://docs.python.org/3/library/tempfile.html

Bio.PDB:

Bio.PDB is a package in BioPython that allows for structural bioinformatics. It can read PDB or mmCIF files and even draw structures from PDB directly. Additionally, it contains functions that allows for analysis of the macromolecule structure. Functions include calculating distances or angles between atoms and even superimposing two structures to compare how similar they are. More information on Bio.PDB can be found here: https://biopython.org/wiki/The_Biopython_Structural_Bioinformatics_FAQ

nglview:

nglview is a package in BioPython that allows for viewing and interacting with 3D protein structures. Once loading the structure, you are able to interact, zoom in, and move the protein around to study its structure. Additionally, you can edit how the protein is viewed. You can view it in cartoon format, change its color, view the hydrogens, and much more. You can also download and display an image after interacting with it. More information can be found here: https://github.com/nglviewer/nglview

Bio.Phylo.TreeConstruction:

Bio.Phylo.TreeConstruction allows us to do phylogenic clustering. The module can read and analyze files to display phylogenic trees based on neighbor joining or Unweighted Pair Group Method with Arithmetic Mean. It can calculate distance matrices and calculate the distances between the proteins being analyzed. Additionally, the resulting tree can be saved in different file formats. More information on this module can be found here: https://biopython.org/wiki/Phylo

Bio.SeqUtils.ProtParam:

Bio.SeqUtils.ProtParam is a module that allows for protein analysis. It can count number and percent of specific amino acids, calculate hydrophobicity, aromaticity, and much more. It can also calculate the isolectric point or charge of the protein at certain pH. It analyzes the properties of the protein in question. More information can be found here: https://biopython.org/docs/1.76/api/Bio.SeqUtils.ProtParam.html

Bioinformatics Analysis

Multiple Sequence Alignment

Multiple sequence alignment is a bioinformatics technique that is used to compare three or more different DNA, RNA, or protein sequences to find similarities and maximum matching between them. Multiple sequence alignment can help with identifying structural or functional components of a novel protein and to trace evolutionary relationships between the sequences being analyzed.

In the code below, we read in sequences from a fasta file and convert them into SeqIO and AlignIO objects to align the sequences. We also create temporary files for each sequence in the fasta file, convert each individual sequence in to a SeqIO object, then loop over the sequences to perform multiple sequence alignment using MafftCommandline.

Homology Modeling and Structural Bioinformatics

Homology modeling is a bioinformatics method that constructs a 3D protein structure using sequences and structures of similar proteins. When proteins have over 20% identity in amino acids, their amino acid sequences can be used to compare and construct a model for a protein that otherwise may not have a predicted structure. Structural bioinformatics allows to view and explore these 3D protein structures as well as manipulate how these proteins are displayed.

In the code below, information from pdb files were used to generate 3D protein structures of different Taffazin orthologs as well as other mitochondrial acyltransferases. The pdb files contain atomic information for each of the displayed proteins.

Plotting The Results

Phylogenetic Clustering

Phylogenetic clustering is a analysis method that compares the genetic sequences of a protein to see how genetically related a group of subjects are. In the code below, Biopython calculates the genetic distance between the different protein sequences in the fasta file containing the different taffazin sequences. Then, these calculated values are used to construct a phylogenetic tree that visually depicts how related the chosen proteins are or where they diverged

3D Protein Measurement

3D protein measurement measures values in the protein structure that was created. Here, we measure the molecular weight, average hydropathy, number and percentage of histidine/aspartate residues, percent composition of each secondary structure, aromaticity, instability index, isoelectirc point, and molar extinction coefficient of the chosen proteins.

Analyzing the Results

Multiple sequence alignment shows that there are conserved spans of amino acid residues in the tafazzin protein as well as other mitochondrial acyltransferases. Structural bioinformatics show that the acyltransferases all have a loop structure (as positioned at the lower left corder in the displayed images). This loop structure may correspond to the cleft mentioned in Hijikata et al, where the substrate (acyl chains) would enter to be positioned and transferred to the phospholipid. Additionally, the phylogenic tree shows how related the human tafazzin protein is to its orthologs and other acyltransferases. The long isoform of human tafazzin protein is least related to the fly and yeast tafazzin proteins. It is also least genetically similar to other acyltransferases that were analyzed in this code. However, it seems that tafazzin from zebrafish, chimps, mice, flies, and the exon 5 deletion of human tafazzin protein all share a common ancestor that had the human tafazzin long isoform. The mouse, chimp and fly tafazzin share a common ancestor that may have had the zebrafish tafazzin. 3D protein structural analysis shows that the human exon 5 deletion, mouse, and chimp tafazzin orthologs have similar hydropathy of around -0.1. However, the human full length, zebrafish, and fly tafazzin orthologs have almost double the hydropathy of -0.2. Interestingly, the yeast ortholog has almost fivefold hydropathy of the human exon 5 deletion tafazzin protein (hydropathy of -0.5). Despite these differences, the orthologs have similar percentages of histidine and arginine residues, except for the fly and yeast orthologs, which had significantly less histidine residues and significantly more asparate residues than the others. Despite the differences in residues, all orthologs and even the other acyltransferases have similar composition of secondary structures. Additionally, the aromaticity, instability index, and isoelectric points across homologs are also similar in value. With the information drawn, my hypothesis was correct that important functional domains of tafazzin, such as the acyltransferase domain, must be highly conserved across species and similar to other phospholipid acyltransferases since the amino acid sequence, protein structure, phylogenetic analysis, and 3D protein measurements show similarities and connections between tafazzin orthologs and related acyltransferases.